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Abstract 



Correction of Noisy Natural Language Text is an important and well studied 
problem in Natural Language Processing. It has a number of applications 
in domains like Statistical Machine Translation, Second Language Learning 
and Natural Language Generation. In this work, we consider some statistical 
techniques for Text Correction. We define the classes of errors commonly 
found in text and describe algorithms to correct them. The data has been 
taken from a poorly trained Machine Translation system. The algorithms use 
only a language model in the target language in order to correct the sentences. 
We use phrase based correction methods in both the algorithms. The phrases 
are replaced and combined to give us the final corrected sentence. We also 
present the methods to model different kinds of errors, in addition to results 
of the working of the algorithms on the test set. We show that one of the 
approaches fail to achieve the desired goal, whereas the other succeeds well. 
In the end, we analyze the possible reasons for such a trend in performance. 



Chapter 1 
Introduction 



One of the major challenges in Natural Language Processing(NLP) is gen- 
eration of correct Natural Language sentences. The degree of correctness is 
measured by how correct the sentence is syntactically and semantically. The 
branch of NLP dealing with this task is known as Natural Language Genera- 
tion. However, Natural Language is like an infinite labyrinth of ambiguities, 
traveling through which does seem to be a rather monumental task. So far, 
we have no satisfactory computational model of the human brain - one that 
can explain how humans truly learn, how knowledge is stored and how is 
it that we can retrieve information extremely fast [1]. The algorithms and 
models used in Natural Language Processing are nothing but collection of 
mere approximations which perform well on a very select set of data. So, 
it is quite obvious that such imperfect models will not be able to generate 
sentences having the quality of actual human-composed sentences. 

The main goal of Natural Language Processing is to allow the human user 
to interact with the computer using Natural Language. Such imperfections 
in sentence structure as mentioned previously totally defeat the purpose of 
NLP. However, due to the absence of actual models of the human brain, at 
this moment we can only try to improve the quality of the language generated 
somehow. Such Natural Language Generation is performed by a number of 
applications like Machine Translation, Question Answering, Dialog Systems, 
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Summarization systems etc. Finding a way to improve the quality of sen- 
tences generated by these apphcations can lead to a better user experience. 
Also, a large volume of data is generated everyday by non-native speakers 
of a language. If we have at our disposal methods to improve the quality of 
noisy sentences, these methods can be used as part of a standalone system in 
order to purify the data generated by the non-native speakers. That is the 
major motivation behind such corrective mechanisms. 

Language post-processing is performed as a part of a number of NLP tasks 
like Machine Translation, Question Answering etc. However, most of the 
work done is reliant upon the availability of resources in that particular 
language, which is not always the case. Here, we look into the problem 
of doing these corrections using only statistical techniques and using min- 
imal language-specific resources. These methods, if successful, would be a 
great step towards solving the problem of noise-removal in resource-sparse 
languages. 

1.1 Language Modeling 

Mathematically, a Language Model [2] is expressed as a probability distribu- 
tion as follows: 

P{W,\W,_,Wi.2...W,_r.+l) 

where every Wj is a word in that language. Language models are computed 
over a corpus of a particular language.lt is usually the case that the larger 
and more diverse the corpus, the language model computed is better. The 
definition above specifies a n-gram language model, where n is known as the 
order of the model. A language model is in fact, a Markov chain, because the 
probability that a word occurs is dependent on its preceding words. These 
probability values are used to score contiguous sequences of words. Given 
a sequence of M words S = si, S2, ■ ■ ■ , sm, the score assigned to it by the 
language model is: 

'^i~2; • • • — n+l) 
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An important concept in language modeling is that of backoff [3] . Due to the 
limitations of the corpus used to compute the language model, it is possible 
that it might not capture all possible word n-grams. So, when a new n-gram 
is encountered, whose probability is not found in the language model, we 
compute the score using a backoff procedure, which is simply an approxima- 
tion of the probability in a relaxed setting. For example, one way to backoff 
might be to try and compute the probability of the word given a smaller 
history, in effect, retracing to the (n — l)-gram and so on. For example: 

P{W\Wi,W2,....Wn) 

~ P{W\Wi, W2, ....Wn-i) (if P{W\Wi, W2, ...Wn) is not present) 

Another method used to compute scores for unknown ra-grams is known as 
smoothing [1]. In this method, we give the unknown n-grams a bit of the 
probability mass after stripping it off from the known n-grams. There are 
many techniques available for smoothing. The most popular ones are Kneser- 
Ney discounting, Witten-Bell discounting, and Good- Turing discounting. 

Language model helps capture the fluency of a document, that is, it eval- 
uates how closely a document resembles a piece of Natural Language text. 
One important thing to note here is that language models do not have any- 
thing to do with the semantics of the language. It does not explicitly tell us 
how correct the sentence is semantically or syntactically, only tells us how 
closely the sequence of words in the text conforms to the sequence of words 
in an actual piece of text in that language. 

1.2 Decoding in Machine Translation 

Statistical Machine Translation (SMT) is an active field of research in NLP. 
The fundamental principle used to model SMT is the noisy channel model [5] . 
Let e be a sentence in the source language. Our aim is to compute the sen- 
tence / in the target language which is the best translation for e. That is, 
we wish to select the / which maximizes the probability P(e|/). The Noisy 
Channel Model assumes that the sentence e and / are essentially the same. 



3 



P(f\e) = argniax P{f)P{e\f) 
f 



Figure 1.1: Noisy Channel Model of Statistical Machine Translation 

The sentence / has been passed through a noisy channel, which has intro- 
duced some impurities in the sentence / and transformed it into e as shown 



in Figure 1.1 



We model the phenomenon using Baye's rule as follows: 

P(f)P(e\f) 

P(/|e) = argmaxf ^^-^ = argmaXfP{f)P{e\f) (1.1) 

The denominator term Pie] is removed because it is constant. As we can 



see from Equation |1.1[ the right hand side has two terms. They are defined 
as follows: 

1. P{f) is known as the Language Model 

2. P(e|/) is known as the Translation Model 

The translation model is computed using a parallel corpus |t6j. A parallel 
corpus contains ahgned pairs of sentences in two languages. The sentences 
in one language are the translations of the sentences in the other language. 
A single monolingual corpus is used to compute the language model. The 
SMT system can be trained on this parallel corpus using techniques such as 
the IBM models [5] and Minimum Error Rate Training [7j. 

The next task at hand is, given the translation and language models, compute 
the translation of a sentence in the source language. Usually, SMT systems 
output the fc-best list of translations, that is, the best k translations accord- 
ing to the translation and the language models. This process is known as 
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decoding [5|. In phrase-based translation models [8], the decoding process 
involves the following steps: 

1. Break up the input sentences into a sequence of phrases. Each phrase 
consists of a contiguous block of words. 

2. Find the best replacement for each of the phrases according to the 
phrase table of the translation model. 

3. Reorder the replaced phrases appropriately using the reordering model 
which is a part of the translation model. 

Different algorithms have been proposed for decoding. Some of the frequently 
used decoders are as follows: 

1. Greedy Decoder [H]: It starts with the gloss for words and improves the 
probability by taking the currently cheapest path. However, it might 
get stuck in a local minima. To avoid this pitfall, it sometimes uses a 
2-step look ahead 

2. Beam Search based decoder [10]: Such a decoder starts translating 
from left to right and maintains a list of partial hypotheses. When new 
hypotheses are added to the space, it prunes out the weakest hypotheses 
in order to accommodate the fixed beam size. But the major problem 
of such a decoding algorithm is the enormous size of the search space, 
which can be exponential in the number of hypotheses. 

3. Word Graph based Decoding [llj: Such a decoder uses a search graph 
over the phrases and uses hypothesis recombination to output the k- 
best list of top translations. 

4. Finite State Transducer Based Decoding: Such decoders model the 
translation model as a FST with input symbols being the phrases in the 
source language and output symbols being phrases in target language. 
The output of the transducer is taken as the translation of the sentence 
input. 
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5. String to Tree model [12]: This is also known as parsing based decod- 
ing. It uses dynamic programming, similar in nature to chart parsing. 
The hypothesis space can be efficiently modeled as a forest structure. 
A recent work in this area is the use of Synchronous Context Free 
Grammars (SCFG) in parsing. The hierarchical phrase-based trans- 
lation model uses such parsing methodology. It also uses a technique 
called Cube Pruning [13] to prune the extremely large hypothesis space. 
This method has a polynomial time complexity and produces extremely 
good results. 
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Chapter 2 

Previous Work and Problem 
Definition 

2.1 Literature Review 

John Truscott (2007) [11] showed via a series of hnguistic experiments that 
error correction increases a learner's abihty to write accurately. The paper 
refutes previous research stating that the benefits, if any, are very little. He 
classified the corrective mechanisms in a number of groups and analyzed the 
factors that probably biased the previous research exdeavours. Hence, cor- 
rection of noisy text is a problem that is of high practical significance, and, 
unsurprisingly, has received a lot of attention in the recent years. Out of 
these, correction of spelling errors is the problem that has probably been 
the most well studied. The most common approach to solve this problem 
is maintaining a dictionary of words and computing the replacement candi- 
date by comparing the word to be replaced to each word in the dictionary. 
But, this would involve a lot of computation. Several methods were pro- 
posed to alleviate this problem. For example, Kukich(1992) [15] , Zobel 
and Dart(1995) [I6], De Beuvron and Trinago(1995) [17] proposed similarity 
keys methods. In these methods, the words in the dictionary are divided 
into classes according to some word features. The input word is compared 
to words in classes that have similar features only. 
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Another approach to solving word errors in text is finite state automata 
based approach. Ofiazer (1996) [IH] suggested a method where all words 
in a dictionary are treated as a regular language over an alphabet of let- 
ters. All words are represented by a finite state machine atuomaton. For 
each misspelt word, an exhaustive traversal of the dictionary automaton is 
initiated using a variant of the Wagner-Fisher algorithm ^19] to control the 
traversal of the dictionary. This method carefully traversals in the dictio- 
nary such that inspection of most dictionary states is avoided. Schultz and 
Mihov (2002) [20] present a variant of this approach where the dictionary is 
also represented as a finite state automaton. In this technique, a finite word 
acceptor is constructed foe each word. The acceptor accepts all words that 
are at a Levenshtein distance [21] k from the input word. The dictionary 
automaton and the Levenshtein automaton are then traversed in parallel to 
obtain the corrections for each word. Hasan et. al(2008) [22] propose an 
approach where they assume that the dictionary is represented as a deter- 
ministic finite state automaton. However, they completely avoid computing 
Levenshtein distance by the use of a Levenshtein transducer. The approach 
can adopt several constraints on which characters can substitute other char- 
acters. These constraints are computed from a phonetic and spatial confusion 
matrix of characters. Some more recent work include the work in Context 
Sensitive Spelling Correction, which tries to detect incorrect usage of valid 
words in a certain context. Using a predefined confusion set is a common 
approach in this task, for example Golding and Roth (1996) [23] and Mangu 
and Brill (1997) [21]. Opposite to non-word spelling correction, in this di- 
rection, only contextual information was taken into account for modeling by 
assuming all spelling similarities are equal. 

In the context of Machine Translation, automatic postprocessing has been 
studied quite widely. But, such methods often target specific phenomena, 
like correcting English determiners (Knight and Chander, 1994) [2S], merg- 
ing German compounds (Stymne, 2009) [26], or applying word substitution 
(Elming, 2006) [2Z]. Stymne et. al (2010) [2H] also proposed another method 
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in using the Swedish grammer checker called Granska to evaluate errors and 
postprocess Statistical Machine Translation output. 

Reordering of Machine Translation output is another problem that has been 
studied quite extensively. Xia and McCord (2004) [29] describe a method for 
French, where reordering rules that operate on context-free productions are 
acquired automatically. Niessen and Ney (2004) [30j describe an approach 
for translation from German to English that combines verbs with associated 
particles, and also reorders questions. Collins et al. (2005) [31] also describe 
an approach for German for reordering of German clauses, which have quite 
different orders from English clauses. Wang et al. (2007) [32] describe a 
method of Chinese syntactic reordering for SMT systems. Once again, it 
is a rule based system which specifies reordering rules for different types of 
phrases in the language and also requires a lot of language specific resource. 
In fact, this factor is common for all the above mentioned methods. None of 
them is language independent. 

Gamon et. al (2009) [33] present a system for identifying and correcting 
English as Second Language (ESL) errors. In this work, they identify the 
common errors made by native Chinese and Japanese but second language 
English speakers. They have a number of error specific modules and they are 
all run in parallel in order to detect errors and correct them. Apart from the 
standard training data, they have also used the web as a potential source of 
information. The web is used to offer corrections to text typed in by the user 
and enable the user to identify whether the text actually matches the user's 
intent. But the problem with this system is that it is quite time consuming 
and a person who knows very little of the language cannot possibly benefit 
from this approach. Also, it was designed specifically for errors made by 
native Chinese/ Japansese speakers. 

Schierle et al. (2007) [31] take spelling correction one step further and use 
it for text cleaning. They combine the edit distance approach with point- 
wise mutual information giving the neighbourhood co-occurrence informa- 
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tion. They use domain specific replacement strategies and abbreviation lists 
to increase the correction precision compared to edit distance alone. But 
the problem with their method is the requirement of a large domain specific 
corpus, which is readily available for news domain, but not always available. 

2.2 The Problem Statement 

As mentioned previously, correction of noisy Natural Language sentences has 
a number of applications. In this work, we try to do the same using a statis- 
tical framework. The advantage of using a statistical framework is the fact 
that we do not need too much of language-specific resources. As a result, 
we can use the same algorithm in order to solve the problem for different 
languages. 

The problem we wish to solve is unsupervised correction of noisy Natural 
Language sentence. To this end, we are given a large monolingual corpus 
in the language which we are working with. That monolingual corpus can 
be used as an indicator to the manner in which sentences are constructed 
in that language. The exact formalism may vary, but we have considered 
a language model formalism to improve the fluency of the sentences. An 
important thing to note here is that sentences are characterized by two basic 
characteristics 

1. Fluency : Given by the perplexity of a set of sentences according to a 
language model 

2. Faithfulness : Given by one or more metric like the BLEU score [HH] 

In this work, we do not concentrate on the BLEU score. Our aim is only 
to improve the fluency of the sentences using a language model constructed 
from the monolingual corpus. Given a noisy sentence and a monolingual 
corpus, we wish to output the sentence that has the best score according 
to the language model constructed from the corpus. However, an important 
thing that we take note of here is the fact that the semantics of the sentence 
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should be kept unaltered as far as possible. For that, we follow a phrase re- 
placement approach. We segment the sentence into a sequence of contiguous 
phrases and find the best replacements for the phrases that would result in 
the best score according to the language model. The problem can be mapped 
to a decoding problem. As previously mentioned, a decoder takes in a phrase 
translation table, a distortion probability distribution and a language model 
and a input sentence and after breaking up the sentence into phrases, com- 
putes the best substitution for it. Since we are working on a monolingual 
setting, our phrase translation table is derived from the language model it- 
self. We do not use the distortion probability distribution. Our algorithm 
works like a decoder with only the language model. 

Given a sentence, finding an optimal segmentation which is in accordance 
with the cognitive process of human beings is a difficult task. As previously 
mentioned, this task is an integral part of our problem definition. Several 
theories have been proposed for this task. However, we do not wish to create 
a formalism which would be dependent on any kind of external language- 
specific resource. As a result, we do not follow any rigorous linguistic method 
of phrase segmentation. We define a phrase as a contiguous sequence of words 
in a sentence and proceed to replace individual phrases in accordance with 
this definition. The next step is combination of the replaced phrases so that 
the final sentence has the highest possible score according to the language 
model. We aim to achieve an output sentence which has a more natural 
word order than the original sentence, but we also aim to keep the meaning 
preserved as much as possible. 

In order to find replacement candidates for phrases, we need to model the 
errors. Some of the commonly observed errors in written English are [37j : 

1. Singular/Plural Form 

2. Verb Tense 

3. Word choice 
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4. Subject/Verb Agreement 

5. Preposition 

6. Spelling errors 

7. Word form 

8. Redundancy 

9. Missing word 

10. SVO/SOV word orders for non-native speakers of language 
We can broadly classify these errors into the following categories: 

1. Reordering based errors 

2. Substitution based errors 

3. Spelling errors 

4. Missing words 

The phrase replacement selection function should be such that it is able to 
accommodate all the above mentioned errors, and also be able to preserve 
the semantics of the original sentence. As a result, it is necessary to select a 
phrase that is not very different from the original phrase in terms of content. 
Also, all the above mentioned errors have to be modeled using a number of 
distance functions. The final distance function should be a combination of 
components for modeling all the above errors. Each component of the dis- 
tance function computes a similarity score between the phrase to be replaced 
and its possible replacement in accordance with the type of error which that 
component is designed to model. The final score is the combination of all 
the individual scores. 

Since noisy data can come from various sources, the errors that occur in 
such data are varied. For example, in chat data, we can see a number of 
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abbreviated forms or non-standard language being used. In blogs, we see a 
number of out of dictionary words being used. Such data more often than 
not do not adhere to the standard syntax. Machine translation systems is 
another way of obtaining noisy data. A major problem with such data is 
the incorrect usage of synonyms resulting from a lack of good bitext. Also, 
if the machine translation system does not have a good enough reordering 
model, we can see a number of reordering based problems that can arise if 
the source and target languages have different word orders. So, the distance 
function being used is dependent upon the application and should be tuned 
accordingly. 



2.3 Evaluation Metrics 

The main purpose of this work is to improve the correctness of a sentence 
given a Language Model. Since a Language Model helps measure the fluency 
of a piece of text, the main quantity being measured here is the fluency. 
Although fluency is not a sufficient metric to measure how good a piece of 
text is. This is because it is possible to string together a set of meaningful 
chunks in order to form a sentence. Such a sentence will get a good score 
according to the Language Model, but will not count as a good sentence in 
the language. As a result, we should also measure the faithfulness of the 
text. Faithfulness can be quantified by how close a given piece of text is in 
comparison to a human written piece of text. This is done by comparing the 
machine- generated text to a piece of standard reference human- written text. 

In order to quantitatively measure the correctness of sentences, we need 
numerical methods in order to judge the fluency and faithfulness of the sen- 
tences generated. Fluency is measured by the Perplexity score according to 
a language model and the faithfulness is measured by the BLEU score. 



13 



2.3.1 Perplexity 

Perplexity helps to determine how probable a given sentence is according to a 
Language Model. In other words, it decides how the phrases will be ordered 
by a fluent speaker of the language. The ordering of phrases is the key factor 
here, as perplexity does not take into consideration any kind of syntactic or 
semantic rules that might govern the formation of a correct sentence in any 
language. Perplexity is purely an Information Theoretic measure. 

Say, we are given a language model LM which defines a probability dis- 
tribution Plm- Let us say that LM is an A^-gram language model. Say we 
have a sentence S = Si, S2, ■ ■ ■ , Si of / words. We define the average log- 
prob per word as follows: 

Average Logprob (LP) = -} Y.\=n logPLM{Si\Si_N+iSi-N ■ ■ ■ Si-i) 
The average perplexity per word is defined as : 
Perplexity (PP) = 2^^ 

We can see from the definition of Perplexity that it essentially is the size 
of the set of words from which the next word is chosen given that we observe 
the history of words spoken. This is highly dependent on the domain of 
discourse, with formal texts usually having a lower value of perplexity. As 
can also be seen from the definition, the Perplexity value is improved if the 
Language Model assigns greater probability mass to words which are actually 
used in correct order. So, if we assume that we have a good enough Language 
Model, we can say that a sentence has good fiuency if it has a lower value of 
perplexity. 
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2.3.2 BLEU Score 



BLEU (Bilingual Evaluation Understudy) is an algorithm for evaluating the 
quality of text which has been machine-translated from one natural language 
to another. Quality is considered to be the correspondence between a ma- 
chine's output and that of a human: "the closer a machine translation is to a 
professional human translation, the better it is" . BLEU was one of the first 
metrics to achieve a high correlation with human judgments of quality, and 
remains one of the most popular. 

Scores are calculated for individual translated segments - generally sentences 
- by comparing them with a set of good quality reference translations. Those 
scores are then averaged over the whole corpus to reach an estimate of the 
translation's overall quality. Intelligibility or grammatical correctness are not 
taken into account. BLEU is designed to approximate human judgment at a 
corpus level, and performs badly if used to evaluate the quality of individual 
sentences. BLEUs output is always a number between and 1. This value 
indicates how similar the candidate and reference texts are, with values closer 
to 1 representing more similar texts. 

BLEU uses a modified form of precision to compare a candidate translation 
against multiple reference translations. The metric modifies simple precision 
since machine translation systems have been known to generate more words 
than appear in a reference text. This is illustrated in the following example 

Example of Machine Translation output with high precision 

Candidate: the the the the the the the 
Reference 1: the cat is on the mat 
Reference 2: there is a cat on the mat 

Of the seven words in the candidate translation, all of them appear in the 
reference translations. Thus the candidate text is given a unigram precision 
of: 

^Example Taken from Wikipedia 
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p rn 7 -1 

^ ~ wt 7 ~ ^ 



where m is number of words from the candidate that are found in the refer- 
ence, and Wt is the total number of words in the candidate. This is a perfect 
score, despite the fact that the candidate translation above retains little of 
the content of either of the references. 

The modification that BLEU makes is fairly straightforward. For each word 
in the candidate translation, the algorithm takes its maximum total count, 
''^max-i in a-ny of the reference translations. In the example above, the word 
"the" appears twice in reference 1, and once in reference 2. Thus rrimax — 2. 

For the candidate translation, the count m^y of each word is clipped to a 
maximum of rrimax for that word. In this case, "the" has ruw — 7 and 
"f^max — 2, thus rriw is clipped to 2. rriw is then summed over all words in 
the candidate. This sum is then divided by the total number of words in the 
candidate translation. In the above example, the modified unigram precision 
score would be: 

^ = 1 

The above method is used to calculate scores for a range of n-gram lengths. 
The length which has the "highest correlation with monolingual human judg- 
ments" was found to be four. The unigram scores are found to account for the 
adequacy of the translation, how much information is retained. The longer 
n-gram scores account for the fluency of the translation, or to what extent it 
reads like "good Enghsh". 

The modification made to precision does not solve the problem of short trans- 
lations, which can produce very high precision scores, even using modified 
precision. An example of a candidate translation for the same references as 
above might be: 
the cat 
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In this example, the modified unigram precision would be, 

2~ 2 2 

as the word 'the' and the word 'cat' appear once each in the candidate, 
and the total number of words is two. The modified bigram precision would 
be Y as the bigram, "the cat" appears once in the candidate. It has been 
pointed out that precision is usually twinned with recall to overcome this 
problem , as the unigram recall of this example would be | or | . The prob- 
lem being that as there are multiple reference translations, a bad translation 
could easily have an infiated recall, such as a translation which consisted of 
all the words in each of the references. 

In order to produce a score for the whole corpus the modified precision scores 
for the segments are combined, using the geometric mean multiplied by a 
brevity penalty to prevent very short candidates from receiving too high a 
score. Let r be the total length of the reference corpus, and c the total length 
of the translation corpus. If c < r, the brevity penalty applies, defined to be 
e*^^~c). (In the case of multiple reference sentences, r is taken to be the sum 
of the lengths of the sentences whose lengths are closest to the lengths of the 
candidate sentences. 

2.4 System Overview 

The entire experimental procedure can be divided into the following compo- 
nents: 

1. Obtaining noisy natural language sentences. 

2. Constructing the language model. 

3. Obtaining the best substitution for an input noisy sentence. 

Out of the number of sources mentioned previously from which noisy data 
might be made available, we have used poor machine translation for our ex- 
periments. In order to obtain the noisy data, we used a Statistical Machine 
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Translation (SMT) system trained on a very small amount of data. The re- 
quired language model for the translation system was derived using this small 
amount of data. This data was fed to GIZA++ [3H] to learn the translation 
model. The language model was created using the SRILM language model- 
ing toolkit |39]. The translation model and the language model were then 
fed to MOSES |1D] decoder which translated the candidate source language 
sentences into target language sentences. These translated sentences are the 
noisy sentences that we attempt to correct in the next step. 

Next, we use a large target language monolingual corpus of the in order 
to construct the language model for correction. Once again, SRILM is used 
to compute the language model in ARPA language model format. This lan- 
guage model along with the noisy sentences are then passed on to the fluency 
improvement algorithm in order to correct the noisy sentences. 



18 



Chapter 3 



Fixed Length Phrase-based 
Correction Model 

Finding an optimal phrase segmentation is the most integral part of our task. 
In the first approach, we try to use fixed length phrases for this purpose. The 
replacement mechanism we follow is local to the phrases themselves, and the 
replacements made to a phrase do not in any way affect the replacements 
made to any other phrase. 

3.1 Algorithm 

The first approach is the Fixed Length Phrase-based approach. In this ap- 
proach, we divide the sentence into a set of phrases of fixed length and 
compute the best substitute for each phrase using overlapping n-grams. The 
algorithm then performs a set of local replacements on each phrase. Let us 
consider an example to see how this algorithm works: 

Let the phrase be P = piP2 ■ ■ - Pj 

Let the language model have order = 4 

Initially we have, starting index equal to 1. So first, consider the subphrase 

PlP2P3P4- 

Let the set of replacements for this phrase be: 
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1. ri = X1X2X3 

2. rs = ymysVA 

First, replace pi . . .p4 with ri. We obtain P' = XiX2X^p^PqPt. 
Next, make a recursive call to the function with P' as the phrase to be re- 
placed and start index equal to 2. Carry out these recursive calls until the 
entire phrase P is replaced and compute the score. Next, perform a back- 
tracking up the recursion stack and replace with the other available choices. 
When the control comes back to the top level function call, replace pi . . . p4 
with r2 and follow the same recursive algorithm. The sequence of recursions 
that give the best score for the replaced phrase is taken to be the replacement 
for the phrase P. This process is repeated for all phrases. 

In order to find the best replacement for a phrase, we use a very simple 

Algorithm 1 Algorithm involving fixed length phrases to compute best 

substitution 

Require: Language model LM 
Require: n — Order of LM 
Require: L =Size of phrases 
Require: Phrase P = W1W2 ■ ■ - Wl 
Require: global current Jbest = Scor-eLM^P) 
Require: global best. sub — P 

recursive function COMPUTE_SUB(P, start-index, Score) : 
if start Andex + n > L then 

{The end of phrase has been reached} 
if LMScore > current J)est then 
current Jbest — Score 
best-Sub — P 
end if 
else 

kbest =FlND_K_BEST{Wstart-index, ■ ■ ■ Wstart.index+n-l) 

for all X e kbest do 

P' ^Substitute W start Jndex ■ ■ ■ start Andex+n-1 with X 

COMPUTE_SUB(P', start Jndex + 1, Score + ScareLM{P')) 
end for 
end if 



20 



heuristic. We choose a phrase Q to be a possible replacement for a phrase P if 
the two phrases share at least 2 common words. The algorithm is formalized 
in Algorithm [1} The function FIND_K_BEST() takes in a phrase as argument 
and returns all phrases in the language model which has at least 2 words in 
common with the given phrase. The algorithm uses a global variable called 
curJbest which stores the best substitute recorded till that instant. The steps 
followed for each phrase P are as follows: 

1. Initialize curJjest as the phrase P and Score as the score of P according 
to language model. 

2. Call the function COMPUTE_SUB(P,0, S'core). 

3. Store the string returned in step 2 as the replacement for the phrase 
P. 

After all the phrases have been substituted, the algorithm reconstructs the 
final sentence by simply concatenating the phrase substitutions. After it 
does so, it computes the score of the entire sentence according to the lan- 
guage model. If the score is greater than the score of the original sentence, 
it outputs the resulting sentence, otherwise outputs the original sentence. 

As can be seen, this algorithm works locally on phrases and does not consider 
the relation between two phrases based on their location in the candidate sen- 
tence. However, within phrases, it follows a continuous substitution policy 
to determine the best substitution. The substitutions within one phrase are 
overlapping. An important feature of this algorithm is the fact that it does 
not bring about any loss in the quality of sentences. Due to the fact that 
we compare the best substitution with the original sentence, the resulting 
sentences are at least as good as the original ones. 

3.2 Experiments and Results 

For our experiments, we used English as the target language. As previously 
mentioned, due to lack of data from other domain, we are using data from 
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Perplexity 


BLEU 


Before Correction 


20.419 


0.662 


After Correction 


20.419 


0.662 



Table 3.1: Results of the fixed length phrase-based correction algorithm 



poor machine translation systems in order to check our algorithms. For test- 
ing the performance of this approach, we used the Europarl [H] Corpus. We 
used 1000 sentences from the Europarl English-German bitext in order to 
train the machine translation system. 100 English sentences comprised the 
test set. In order to build the language model, we used the entire English 
part of the Europarl corpus. For perplexity computation, we used the per- 
plexity computation module of SRILM. 



For experimentation, we used a 4-gram Language Model, i.e., n = 4. The 
length of phrase size was chosen to be 7. The results obtained from evaluat- 



ing over the 100 sentences in the test set are summarized in table 3.1 



We can see that this algorithm does not do well. In fact, we see no change in 
perplexity or BLEU score, signifying that it has, in fact, not made any change 
to the set of sentences. Therefore, this algorithm does not bring about any 
change to the quality of sentences in the test set. An analysis of the sentences 
used as the test set reveals the same fact. The output sentences were exactly 
the same as the input sentences. 



3.3 Analysis 

An analysis of the steps of this algorithm reveals the fact that this algorithm 
probably performs too many substitutions, thereby actually attempting to 
increase the amount of noise in the sentences. Also, human beings do not 
usually attempt to make changes in sentences in this fashion. Corrections 
are made on phrases. One phrase is taken as a single unit, and some changes 
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are made in that phrase so that the resultant phrase resembles the actual 
phrase in some way, but the errors of the original phrase are eliminated. 
Also, this algorithm fails to deal with inter-phrasal correlation, for which 
a conditional probability among phrases should have been used. All these 
factors contribute to the failure of the algorithm. 

Another factor that does not support the use of this algorithm is the time 
complexity, for each phrase, the algorithm takes time which is exponential 
in the number of replacement candidates. In order to understand the time 
complexity of the algorithm, let us model the phrase replacement procedure 
as a graph. Every possible phrase is a node and there exists an edge from 
one node Vi to another node vj if there exists a possible substitution which 
can transform Vi into Vj. This algorithm tries to explore all possible sequence 
of substitutions to transform a sentence. Hence, in the graph model, it tries 
to explore all possible paths from a starting node. Since a graph can have 
exponential number of such paths, this algorithm has an exponential time 
complexity which renders it useless for almost all practical purposes. 
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Chapter 4 



The Dynamic Programming 
based method 

4.1 Motivation 

The failure of the fixed length phrase-based approach led us to conclude that 
this approach was only trying to increase the amount of error already present 
in the sentences by trying to make too many substitutions. So we hypothe- 
size that an approach that takes in a phrase as one unit and makes corrects 
a phrase in one step after identifying the errors might be a good way to pro- 
ceed. Also, the previous approach did not have any function that explicitly 
models all the previously mentioned errors. As a result, in this approach, we 
also modify the function used to select the replacements fro the phrases. An- 
other problem with the previous approach is its exponential time complexity 
and the fact that it consumes much memory due to its recursive nature. We 
need to find a more efficient way of selecting the best phrase segmentation 
of a sentence. We also want to allow variable length phrases. All this can be 
accomplished by a simple recursive formulation of the problem as follows: 
Suppose, we have a sentence S = S1S2 ■ ■ ■ Sn 

Suppose, we wish to find the best replacements for the phrase Sij = Sj . . . Sj 
Suppose, I denotes the concatenation operator. Then, we can write, Sij = 
Sii\S(i+i)j for any i<l <j. 
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Hence, the task of computing Sij reduces to computing the best / in accor- 
dance with the above equation. Here, we define "best" as the / for which 
the replacements have highest score according to language model. Iteratively 
carrying out this operation would give us the replacements for Sin = S. 

We can easily see that this can be modeled as a bottom up dynamic program- 
ming problem. If we have replacement candidates for Su and ^(j+i)^, we can 
combine them using the above equation in order to get Sij. This formulation 
is, in fact, identical to the formulation of the famed Cocke- Younger-Kasami 
(CYK) algorithm [32] [33] [33] for parsing of context-free grammars. Once we 
have the initial set of replacements for each Sij computed using the distance 
function mentioned subsequently, we can use this formulation to combine 
them all in order to find the best substitute for the entire sentence. An im- 
portant point to note with variable phrase lengths is the fact that we need to 
explore all possible phrasal decompositions, whose number is exponential in 
the length of the sentence. However, this formulation allows us to compute 
the best decomposition in time polynomial in the length of the sentence. 
Thus, this formulation gives us a more efficient way of solving the problem. 

4.2 Algorithm 

Suppose we have a sentence S which we wish to correct. Say S has N words 
Si, ... , Sn- So, 
S = S1S2 . . . Sn 

This sentence may be divided into an 0{N'^) number of contiguous phrases. 
Each phrase may begin at any 5*^ and end at any Sj satisfying the constraint 
1 < < j < A^- We denote a phrase beginning at Si and ending at Sj as 
Pij. Let the Language Model be denoted by LM. The algorithm initially 
computes a list of substitutes for each Pij from LM. For this purpose, a 
function FIND_BEST_SUB() is called. It takes in a phrase P and returns a 
list of k best replacement phrases for P taken from LM. For now, we treat 
this function as a black box. This function is discussed in detail in the next 
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section. 



Let the list of top-k substitutes for Pij be denoted by SUBij. SUBij con- 
stitutes the initial list of substitutes taken from the Language Model. Each 
entry of SUBij is a 2-tuple of the form (phrase, score) where the score of the 
phrase is the logprob value according to LM. We combine these lists using a 
Dynamic Programming algorithm. We fill a table REP. It is a x table. 
The entry REPij consists of the top k substitutes for Pij taken from SUBij 
and other entries of the table SUB. REPij is computed as follows: 

1. REPij = SUBij 

2. V/ji < / < J — 1, we have k"^ candidate substitutes for Pij because 
there are k candidates present in each of SUBu and SUB(^i^i)j. For 
each of these k"^ candidates, Compute the score of the phrase 
according to LM, where Sik G SUBik and S(^k+i)j £ SUB(^k+i)j- Here | 
denotes the string concatenation operation. 

3. For each of the fc^ phrases constructed in Step 2, add each (phrase, 
score) pair to REPij. 

4. Retain the top k entries in REPij based on the scores. 

The entries of REPin give us the final optimum substitution for the sentence 
S. The algorithm is formally stated in Algorithm |2} 

The algorithm fills up a A^ x A^ table. First, it fills up the entries corre- 
sponding to phrases of length 1, then phrases of length 2 and so on. Let us 
consider a small example to see how the algorithm works: 

Say, we are trying to fill up REPij. Initialize REPij = SUBij. 
Say, we take a division at some /. 

Let, REPu = {Pi, P2, . . . , Pk} and REP(^i+^y = {Qi, Q2, • • • , Qk}- 

The strings we add to REPij are {PiQi, P1Q2, • • • , PiQk, P2Q1, • • • , P2Qk, • • • , PkQi, • • • , PkQk}- 
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Algorithm 2 Dynamic Programming based correction algorithm 
sentence S — S1S2 ■ ■ ■ Sn 
phrase — Si. . . Sj. We have S — Pin 

SU B = N X N array storing the initial replacements. SU Bij has replace- 
ments for Pij 

REP = N X N array storing the final replacements. REPij has replace- 
ments for Pij 

yiji <i<j<N, compute SUBij = FIND_BEST_SUB(Py) 
for i = 1 to iV do 

RE Pa = SU Bii I /Fill up the table for all strings of length one 
end for 

/ /Length of substrings being replaced 
for Z = 2 to do 

//Starting position of the string being replaced 
for i = 1 to - / + 1 do 

/ /Iterate over all possible break points 
for 2—%\.o%^l — \ do 

/ /Initialize REPi(^i^i_i^ to already computed replacements 
REPi(i+i-i) = SUBi(iJ^i^i) 
for all (p,s) G SUBij do 

for all (p',s') e do 

q = p\p', where | is the concatenation operation 
r=score(g) 

REP^^i+i^i) = REPi^i+i^i) U (g, r) 
end for 
end for 
end for 

P£'Pj(j+;_i) = Top k {q, r) pairs having best scores from REPi(^i^i_i) 
end for 
end for 
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Figure 4.1: An illustration of how the Dynamic Programming algorithm 
proceeds for = 5 



Then, we retain the top k best strings from the set REPij according to the 
scores assigned to the strings by the language model. 



Figure 4^ gives us an illustration of the way the algorithm works. The 
shaded area of the table is the one which is actually filled during the course 
of the algorithm. The arrows depict the direction in which the table is filled. 



The algorithm runs in polynomial time once the function FIND_BEST_SUB() 
has been computed. The Complexity of the remaining part of the algorithm 
stems from the 5 nested for loops. The three outermost loops run in 0{N) 
time each and the two innermost loops run in 0{k) time each, giving this 
loop a time complexity of 0{N^k'^). So, the overall time complexity of this 
algorithm is 0{N^P). In addition, the function FIND_BEST_SUB() is called 
for 0{N'^) times. 

A naive implementation of the function FIND_BEST_SUB() would be to per- 
form a complete linear scan of the entire list, match the target phrase with 
each phrase in the list and select the best matches. This would be very in- 
efficient, because of the extremely large number of phrases in the language 
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model. Hence, we need a more efficient metfiod of implementing this func- 
tion. This implementation issue has been handled in the next section. 

4.3 Computing the Phrase Substitutes 

The phrasal substitutes have been computed using the FIND_BEST_SUB() 
function. This function takes in a single argument (the phrase p) and re- 
turns a list of 2-tuples of the form {q,s) where g is a candidate replacement 
for p and q has a score of s according to the language model being used. As 
mentioned in the previous section, a naive implementation of this function 
using linear search and a distance function would result in very high time 
complexity. So, an efficient implementation is essential. 

This problem can be modeled as an Information Retrieval task. We can 
treat each phrase in the language model as a document and the replacement 
candidate as the query. Our aim is to find out which documents best match 
the query and also have high scores. In order to compute the A;-best list, we 
follow a 2-step process: 

1. Prom all the phrases in the language model, compute the top T best 
phrases that match the replacement candidate. 

2. From the T phrases computed in the previous step, compute the k 
phrases having the best scores. 

As in every Information Retrieval system, we need to first index the docu- 
ments. Each phrase in the language model is treated as an individual doc- 
ument and assigned a docid. Next, an inverted index is constructed based 
on the words present in the phrases. The inverted index is similar to the in- 
dex used in standard Information Retrieval systems. Corresponding to every 
word, there is a postings list of docid in which that word occurs. A dictio- 
nary of words is also maintained. In order to facilitate implementation, the 
words in the dictionary are stored in a prefix-tree (trie), which significantly 
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lowers the search complexity for the dictionary. 

During the retrieval phase, we use a distance function between words in 
order to match docid to the query. Let this distance function be D{wi, ^2). 
The retrieval process can be formally expressed as follows: 

Say, the query q contains Wg words qi,q2, ■ ■ ■ , qwq- 

Let us say, by applying D{qi,Wj) for all Wj in the dictionary, we get the 
set Sq — Si, S2, Sx to be the set of words that are similar to the words in 

Wq. 

Let the postings list for each Sj in Sg be represented by Zj. 

Then the list of documents which satisfy the query is given by Outq — Uf^^Zj. 
Thus, Outq is the list of phrases that have words which are in some way 
similar to the query replacement candidate phrase q. 

In order to carry out the intersection in the previous step efficiently, the 
docids in each postings list is maintained in sorted order. Such representa- 
tions in the form of a linked list give us an union time which is linear in the 
size of the two hsts. 

Once we get the list of phrases which have words similar to the words in 
the replacement candidate phrase, we can do a simple linear scan to find 
out the best T phrases from this list. This is because, in practice, we have 
\Outq\ « M. We use a phrase level distance function computing the T- 
best list. A number of facts were taken into consideration while defining this 
distance function. The distance function is a linear combination of different 
components catering to different possible sources of error. They are: 

1. Orthographic Errors, like typographical errors etc. 

2. Word errors, for example, using words out of context or at unsuitable 
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positions. 

3. Reordering based errors, most commonly seen in the case of Machine 
Translation from Free Word Order languages to Fixed Word Order 
languages. 

The different components of the distance function are as follows: 

1. Levenshtein distance |2T] between individual words (/i), which models 
the orthographic errors. 

2. Synset Distance in Wordnet (/2), which models the incorrect use of 
synonyms in sentences. 

3. Word order based metrics (/a), which models the errors created by 
different word order in different languages. 

Distance function (fd) can be represented as a linear combination of the 
above metrics: 

Each of these metrics is explained in detail in the following subsections. 

4.3.1 Levenshtein distance between individual words 

Levenshtein Distance is a commonly used metric in tools like spell check- 
ers. It is commonly known as the Edit Distance algorithm. It is used to 
find out the distance between two words. Levenshtein distance measures the 
minimum number of changes required in order to transform one word into 
another. For this, it permits the following operations: 

1. Insertion: It refers to inserting a new character to transform the word. 
For example, cat— i-cart. In this, r is inserted between a and t in the 
original word. 
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2. Deletion: This operation is essentially the reverse of insertion, where 
we delete an already present character in the original word to form a 
new word. For example, cart— >cat. In this, the r between a and t is 
deleted in the original word. 

3. Substitution: This operation involves the replacement of one charac- 
ter in the original word with another character. For example, cart^cast. 
Here, the r in the source word is replaced by s. This operation can 
be seen as a combination of insertion and deletion. The previous ex- 
ample can be expressed as an insertion followed by a deletion, like, 
cart^carst^cast, or as a deletion followed by an insertion, like cart— 7>cat— >-cast. 
However, we use this as a separate metric in order to lower the edit dis- 
tance value. 

The Levenshtein distance measure is used in the algorithm at two stages: 

1. Firstly, this metric is used in order to fetch the phrases which seem 
relevant to the replacement candidate. This is used as a part of the 
distance function D{wi, iL'2). Say, for some Qi G Wq, and some Wj in the 
dictionary, we have D{qi,Wj) < {Dt is a predetermined threshold). 
Then we include Wj in Sq. 

2. Secondly, this metric is used in the computation of the A;-best list of 
replacements as part of functions /i and /i. The details are as follows: 

(a) As part of /i: 

Say, we are given a phrase P — pi,p2, ■ ■ - Pn and a possible replace- 
ment for that phrase R — ri, r2, . . . r^. We compute the Leven- 
shtein distance between each (pj, rj) pair. We normalize this value 
by the maximum of the lengths of pi and rj. 
Now we define the score for the phrase R as the minimum nor- 
malized edit distance for a word in P with words of it! summed 
over all the words in P. Formally speaking. 
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V— ^ LevenshteinDistance(p,r) 

1 — l^peP^^^reR max(len{pi),lenirj)) 

(b) As part of f^: 

In the function f^, the Levenshtein distance metric is used in or- 
der to preprocess the phrases before computing the actual word 
order based distance metrics. The preprocessing phase actually 
performs an alignment between the words in the replacement can- 
didate and the target phrase. As in the previous case, let us 
consider P to be the phrase to be replaced and R to be a can- 
didate replacement phrase. A word p E P is said to be aligned 
to a word r E R iS the Levenshtein distance between p and r is 
less than some threshold. In this phase, we perform a one-to-one 
alignment, with ties broken arbitrarily. That is, if two words in P 
get aligned to a single word in R, we choose only one of the words 
of P to be aligned and ignore the other one. 

4.3.2 Synset Distance in Wordnet 

Almost all languages have an extremely rich vocabulary. As a result, it 
becomes almost impossible for a single person to remember all words in a 
language. So, it is quite often the case that words are used in an incorrect 
fashion. For example, quite often it is the case that a word is used in a 
sentence while one of it's synonyms or a word closely related to it has to be 
used in its stead. Let us consider the following sentence: 

He was a one game surprise. 

However, a more correct and semantically apt sentence would be: 
He was a one game wonder. 

Such inaccuracies can be taken care of only when the words in question 
are replaced by more suitable words. That is where the synset based word 
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matching part of the distance function helps us compute better replacements 
for a word in a phrase. 

In order to compute this metric, we use the Wordnet [15]. WordNet is a 
lexical database for the English language. It groups English words into sets 
of synonyms called synsets, provides short, general definitions, and records 
the various semantic relations between these synonym sets. The purpose is 
twofold: to produce a combination of dictionary and thesaurus that is more 
intuitively usable, and to support automatic text analysis and artificial in- 
telligence applications. The database and software tools have been released 
under a BSD style license and can be downloaded and used freely. The 
database can also be browsed online. 

Each synset in the wordnet consists of words having similar semantics, or 
in short, synonyms. The wordnet also supports a number of other relation- 
ships between synsets. In fact, we can view it as a graph where the vertices 
are the synsets and the edges are the relationships among the synsets. Some 
of the relationships supported by the Wordnet are: 

1. Hyponymy 

2. Hypernymy 

3. Meronymy 

4. Holonymy 

In this work, only the synonyms have been used. As before, let us consider 
the replacement candidate to be P and a possible replacement to be R. For 
each word in P, we check if there is any word in R which belongs to the same 
synset. We define a Boolean function g{p,R) which takes the value of 1 if 
there is any r E R belonging to the same synset as p in the Wordnet, and 
otherwise. The function /2 is defined as the sum of the function g{p,R) for 
all words p E P and normalized by the length of P. Therefore, we have. 
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■^2 — len(P) 

4.3.3 Word order based metrics 

Word order becomes an important factor, especially in translation tasks. For 
example, in a language like Bengali or Hindi, the order of words is Subject- 
Object- Verb, whereas in English it is Subject- Verb- Object. So, a common 
problem in translation systems is the lack of a good reordering model. Also, 
when non-native speakers or new learners of a language try to write sentences 
in the language, word order errors are some of the more commonly observed 
errors. Hence, this is a problem worthy of consideration. And that is why 
this factor in the distance function is extremely important. If this factor is 
not kept, phrases might be chosen which have an incorrect word order but 
higher overall score. This factor takes care of the fact that correct word order 
is preserved when selecting replacement phrases. 

In this work, two types of functions have been used for word order preserva- 
tion: 

1. Rigid Word Order: As mentioned in the subsection on Levenshtein 
distance, first a word level alignment is performed between the phrases 
P and R. Next, it is checked if the words in it! ahgned to P are in 
the same order as the corresponding words in P. The phrase it! is 
considered for replacement if and only if the aligned words are in the 
same order in both P and R. Otherwise, the phrase R is discarded. 

2. Flexible Word Order: The rigid word order metric does not always 
work. In fact, a certain degree of flexibility is required so that word 
order related problems can be taken care of. But it must be ensured 
that the flexibility is not absolute, otherwise, it might lead to the intro- 
duction of more noise in the sentences. In this case too, we consider the 
previous model where a phrase R is a. possible replacement candidate 
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for a phrase P. Using the word based Levenshtein distance algorithm, 
first an ahgnment is computed. Each word pair in the phrases is tagged. 
Next, either one of these two metrics is computed: 

(a) Length of Longest Common Subsequence: The length of the Longest 
Common Subsequence (LCS) is computed between the aligned 
words in R and P. This value is normalized by the number of 
aligned words and that gives us the value of f^. 

(b) Number of Inversion Pairs: Given an array A = [Ai, A2, . . . , An], 
A[i] and A[j] is said to be an inversion pair iff i > j but A[i] < A[j]. 
In this method too, the words are aligned and the number of 
inversion pairs in R are computed. The inverse of the number of 
inversion pairs gives the value of f^. 

Let us consider an example: 
Say, R = ri,r2,r3,r4 

And,P =pi,p2,P3,P4,P5 

Say, after the alignment phase, we have pi aligns to r2, P2 aligns to 

and p4 aligns to r^- 

So, the number of aligned pairs is 3. 

Let us tag the pairs with numbers and ignore the rest of the words for 
the time being. So, the phrases become: 
P = l,2,3 
= 1,3,2 

If we consider the LCS metric, the value will be /s = |, as 2,3 is a 
possible Longest Common Subsequence of P and R. 
If we consider the Inversion metric, the value will be /a = |. 3,2 is an 
inversion pair in R. 

.4 A Representative Example 

3t us consider the following sentence: 
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the europe extreme right in is chciracterized by its and its use 
of immigration as differences the issue of. 

We can see that there are a number of errors in this sentence. For example: 

1. The phrase europe extreme right should become european extreme right 
or extreme right of europe. 

2. The word in should not be there after right. 

3. The phrase its and its use. There should either be a word after the 
first its, or the phrase should be its use. 

4. The phrase differences the issue of makes no sense. This phrase should 
be reordered or replaced by something else. 

Let us now see how the algorithm works on this sentence. 

Firstly, it calls on the function FIND_BEST_SUB() in order to compute the 
best replacements for all possible phrases. We have used the 5 best replace- 
ments in this experiment. A snapshot of the replacement set of some of the 
more significant phrases is given below along with the scores. Each replace- 
ment is written as a (phrase,score) pair. 

ySC/S (europe) = {(europe, 5.9865), (european, 5.9012), (european union, 
4.5213), (the european hfe, 2.3242), (european governments, 2.2312)} 
ySJ/E (extreme) = {(extreme, 1.2121), (extremist, 0.9821), (extremist wings, 
0.5632), (the islamic extremists, 0.2180), (fundamentalists and extremists, 
0.1134)} 

5'C/5(europe extreme) = {(european, 5.1212), (european extreme, 4.9212), 
(extremist threat, 4. 2911), (threat to europe, 3.2931), (terrorism in europe, 
3.1023)} 

5'C/5(europe extreme right) = {(european extreme right, 4.2313), (ex- 
treme europe right, 4.0981), (european right wing, 3. 5673), (european liberal 
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politics, 3. 2342), (the european union, 3.2312)} 

ySC/S (extreme right in) = {(extreme right is, 3.4562), (extreme right, 
3.2576), (rightist extremists are, 2.7853), (in the extremist propaganda, 1.0123), 
(right to say, 0.4323)} 

SUB{its and its use) = {(its use, 5.2142), (that it uses ,4,2123), (its diplo- 
macy and its, 4.0123), (that it had used, 3.1463), (they had been used, 
2.6432)} 

SUB{as differences the) = {(make good the differences, 2. 1231), (as the 
differences, 1.4654), (differences of the, 1.2435), (different from the usual, 
0.3461), (different issues in the, 0.2423)} 

^'[/^(differences the issue of ) = {(issue of the differences, 2.3242), (ad- 
dress the issues of differences, 2.1982), (address the matter of, 1.5692), (the 
differences among the, 1.2342), (the problems of different, 0.8734)} 

The length of the sentence is = 19. So, the algorithm creates a 19 x 19 
matrix and fills up its diagonal and the upper triangle. First it fills up the 
diagonal for all the unigrams in the sentence, then all the bigrams and so 
on. In order to see how this algorithm actually works, let us consider the 
phrase the europe extreme right in. This phrase can be broken down in 
the following ways: 

1. the -|- europe extreme right in 

2. the europe -|- extreme right in 

3. the europe extreme -|- right in 

4. the europe extreme right -|- in 

After the lower rows of the table is filled up, the cell corresponding to the 
phrase the europe and extreme right in looks as follows: 
REP{the europe) = {(the europe, 12.2341), (the european, 12.1914), (the 
the european union, 10.1231), (not in the european union, 9.1231), (the eu- 
ropean governments, 6.2114)} 

i?£'P(extreme right in) = {(extreme right is, 3.4562), (extreme right, 3.2576), 
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(rightist extremists are, 2.7853), (in the extremist propaganda, 1.0123), (right 
to say, 0.4323)} 

Combining these two, we get 25 candidates for replacing the phrase the 
europe extreme right, hke: 

1. (the europe extreme right is, 16.0012) 

2. (the europe extreme right, 15.8212) 

3. (the europe right extremists Eire, 15.1231) 

4. (the europe in the extremist propaganda, 14.2111) 

5. (the europe right to say, 12.9015) 

6. (the european extreme right is, 16.2918) 

7. (the european extreme right, 15.9210) 

8. (the european rightist extremists are, 15.0012) 

9. (the european in the extremist propaganda, 13.5681) 

10. (the european right to say, 13.0001) 

11. (the the european union extreme right is, 13.6221) 

12. (the the european union extreme right, 13.5512) 

13. (the the european union rightist extremists are, 12.9899) 

14. (the the european union in the extremist propaganda, 11.5101) 

15. (the the european union right to say, 10. 7014) 

16. (not in the european union extreme right is, 12.8213) 

17. (not in the european union extreme right, 12.5612) 

18. (not in the european union rightist extremists are, 12.5235) 
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19. 


(not 


in the european union 


in the extremist propaganda, 11.2322) 


20. 


(not 


in the european union 


right to say, 9.4534) 


21. 


(the 


european governments 


extreme right is, 10.9219) 


22. 


(the 


european governments 


extreme right, 10.6212) 


23. 


(the 


european governments 


rightist extremists are, 9.0028) 


24. 


(the 


european governments 


in the extreme propaganda, 7.5422) 


25. 


(the 


european governments 


right to say, 6.8921) 



The 5 entries marked in bold in the above hst mark the best 5 selections 
for the candidate phrase the europe extreme right. The same process is 
used to fill up the table REP, the best candidate replacement chosen for the 
entire sentence turns out to be: 

The European extreme right is chciracterized by its use of immi- 
gration as the issue of differences. 

As we can see, a number of changes have been made in the original sen- 
tence, like: 

1. europe extreme right — )■ european extreme right 

2. its and its use — >■ its use 

3. differences the issue of the issue of differences 

Quite clearly, the final sentence is more correct than the original sentence. 

4.5 Experiments and Results 

The variable part of the correction system described previously was the dis- 
tance function used in order to find out suitable replacement candidates for 
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Function Code 


/i 


/2 


/3 


A 


Levenshtein 
Distance 


Synset 
tance 


Dis- 




B 


Levenshtein 


Synset 


Dis- 


Rigid Word Or- 




Distance 


tance 




der 


C 


Levenshtein 
Distance 


Synset 
tance 


Dis- 


LCS Length 


D 


Levenshtein 


Synset 


Dis- 


Number of In- 




Distance 


tance 




version Pairs 



Table 4.1: The Distance functions used for experimentation 



the phrases. We conducted a number of experiments using several combi- 
nations of these distance functions. The used combinations are tabulated in 
Table El 



Experiments were performed using the Europarl Corpus. The languages 
chosen were German and English. Since we did not have sufficient amount 
of noisy dcitcL cit our disposal, so we used a poorly tuned Statistical Machine 
Translation system in order to obtain the incorrect sentences. The source 
language was German and the target language was English. The correction 
was done on the English sentences. In fact, to make the task more difficult, 
a number of errors were manually incorporated into the incorrect English 
sentences to further increase the amount of noise present in the sentences. 



The system is the same as explained in Section 2.4. In the implementa- 
tion of the distance functions, there is one parameter that needs to be fixed. 
It is in the implementation of the Levenshtein Distance function, where a 
value of Z)t = 3 has been used. This value ensures that the substitute word 
chosen is not too different, nor too similar to the word in question. We have 
also provided equal weight to all the components of the distance function. 



that is, VI < z < 4, aj = 0.25. Table |4.2| sums up the values of Perplexity 
and BLEU score for the different choices of the distance function. In order 
to analyze the performance of our algorithm on Machine Translation, we 
peformed another set of experiments. For this purpose, we did not use a sep- 
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Original 




Corrected Sentences 






Sentences 














Function 


Function 


Function 


Function 






A 


B 


C 


D 


Perplexity 


20.419 


19.064 


18.716 


18.656 


18.658 


BLEU 


0.662 


0.662 


0.663 


0.665 


0.664 



Table 4.2: Summary of Results over all the used distance functions 

arate Machine Translation system, but used the Google Translatoi{^ instead. 
The corpus chosen was the FIREUo08 ad-hoc retrieval corpus. First, we se- 
lected 100 sentences from the English corpus. We translated those into Hindi 
using the Google Translator system and translated back the Hindi sentences 
into English using Google Translator again. Since English-Hindi translation 
in Google translator does not give us very good results, so a lot of noise was 
incorporated into the sentence. Let us take a look at some of the sentences 
in order to understand the amount of noise introduced: 

1. Original: Specifically CII has appreciated the special initiatives for 
agriculture, gems and jewellery, handlooms, leather and footwear. 

FinahCII agriculture particularly gems and jewelry, handicraft, leather 
and shoes for the special initiative is appreciated. 

2. Original:Indian Chamber of Commerce president Anup Singh said. 
The special package for agriculture and schemes like Vishesh Krishi 
Upaj Yojna will boost exports of fruits, vegetables and their value- 
added products. 

Final: Indian Chamber of Commerce president, said Anoop Singh, Vishesh 
agricultural planning and agricultural schemes like the special packages 
for fruits, vegetables and their value-added products will boost exports. 

"'^http: / / translate. google. com/ 

^FIRE (Forum for Information REtrieval) is a conference conducted by ISI Kolkata. 
http://www.isical.ac.in/~clia/ 
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Perplexity 


BLEU 


Before Correction 


27.521 


0.557 


After Correction 


16.121 


0.569 



Table 4.3: Results of experiments with twice translated sentences taken from 
the FIRE corpus 

3. Original:GMG plans to fly thrice a week between the south-eastern 
port city of Chittagong and Calcutta. 



Final: CMC to fly three times a week southeastern port city Chit- 
tagong and Kolkata between plans. 

4. Original:GMG hopes to be allowed to fly to other Indian cities con- 
necting capital Dhaka with Delhi and Mumbai. 

Final: CMC to other Indian cities Mumbai and Delhi to Dhaka link 
is expected to be allowed to fly. 

5. Original: The government recently signed a free trade pact with Thai- 
land, which it hopes will ultimately cover all nations in the region. 



Final: The government recently Thailand, which it hopes will even- 
tually cover all countries in the region with a free trade treaty signed. 

As is obvious, the final sentences contain a lot of errors, the most common 
being global clause reordering. Unfortunately, our system is not built to 
handle this problem. We used our noise correction algorithm with the Dis- 
tance Function C on these sentences. We used the entire FIRE 2008 ad-hoc 
English corpus for computing our language model. The results are summa- 



rized in Table 4.3 Let us examine some of the output sentences from this 
experiment: 

1. CII agriculture particularly gems and jewelry handicraft leather and 
shoes for the special initiative is appreciated. — t- CII in agricultural 
particularly gems and jewelry handicraft leather and shoes special ini- 
tiative has been appreciated. 
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2. Indian Chamber of Commerce president said Anoop Singh Vishesh agri- 
cultural planning and agricultural schemes like the special packages for 
fruits vegetables and their value added products will boost exports. 
— > Indian Chamber of Commerce president Anoop Singh said Vishesh 
agricultural planning and agricultural schemes like the special packages 
for fruits vegetables and their value added products will boost revenue 
from exports. 

3. GMG to fly three times a week southeastern port city Chittagong and 
Kolkata between plans. — )■ GMG to fly three times a week between 
southeastern port city Chittagong and Kolkata. 

4. GMG to other Indian cities Mumbai and Delhi hnk is expected to be 
allowed to fly. — > GMG to other Indian cities like Mumbai and Delhi 
link is expected to be permitted to fly. 

5. The government recently Thailand, which it hopes will eventually cover 
all countries in the region with a free trade treaty signed. — > The 
Thailand government now hopes will eventually cover all countries in 
the region with a signed free trade treaty. 

As we can see from the above examples, the global reordering problem has not 
been solved. But the algorithm has indeed made local changes to clauses and 
phrases and the output sentences are indeed better than the input sentences 
in terms of fluency. The semantics have not always been preserved, but that 
is not what we have aimed to do anyway. 

4.6 Analysis 

The Dynamic Programming based correction model succeeds in improving 
the fluency of sentences quite signiflcantly. This algorithm is quite similar to 
the standard decoding algorithms, where given a sentence, one is supposed 
to flnd a sequence of phrasal segmentations and using the phrase table, flnd 
a set of substitutes which would maximize the a-posteriori probability of the 
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sentence being generated. In this algorithm, since we do not have any ex- 
phcit phrase table, we implement it by searching for nearly matching phrases 
from the Language Model itself. The distance function helps figure out what 
is a good match. This algorithm does not use a global distortion model 
like is done in most Statistical Machine Translation systems. The rationale 
behind this is the fact that normally errors in writing text do not involve 
long-distance reordering errors, rather most of the wrongly written words 
occur within a small window. Our test set was so designed that it contained 
sentences having all the types of errors mentioned in Section 2.2. Some types 
of errors were obtained right from the translation output, while some were 
introduced manually. 



A quick glance at Table 4.2 shows that the order of words in the chosen 
replacement phrases is extremely important, as it significantly lowers the av- 
erage perplexity of the set of sentences. Although the Distance Function A 
lowers the perplexity of the test set of sentences, indicating that correction 
of fluency has indeed taken place, the lower perplexities in case of Distance 
Functions B,C and D imply that total freedom in case of word order may not 
be the way to go. In fact, this is quite intuitive from the fact that English is 
not a Free Word Order language. 



A look at the BLEU scores shows that they have not significantly changed 
in order to draw a conclusion about the algorithm's ability to alter the faith- 
fulness of the sentences. However, a comparison between BLEU scores and 
Perplexity value shows that the BLEU score also follows the same trend as 
the perplexity values. Also, the Distance Functions C and D have resulted in 
almost comparable values of perplexity and BLEU, showing that using either 
of them would give us a good enough result. 

Now let us look at some sentences that were corrected by this algorithm 
and try to analyze the errors that each distance function corrects: 

1. The parties dominant of the center left and center right have faced 
prospect this in the policy of the ostrich applying. 
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Distance Function A: The parties dominant of the center left and 
center right have faced this prospect in the applying principle of the 
ostrich. 

Distance Function B: The dominant parties of the center left and 
right have faced this prospect in the policy of the ostrich applying. 
Distance function C: The dominant parties of the center left and 
right have faced this prospect in the policy of the ostrich applying. 

2. This is precsely the aim pursued largely by research teams in economics, 
sociology, psychology and pol science in the United States. 
Distance Function A: That precisely is the aim largely pursued by 
teams of research in economics, sociology, physiology and political sci- 
ence in the United States. 

Distance Function B: This is precisely the aim pursued largely by 
teams of research in economics, sociology, psychology and political sci- 
ence in the United States. 

Distance Function C: This is precisely the aim pursued largely by 
teams of research in economics, sociology, psychology and political sci- 
ence in the United States. 

3. This does not mean that it should be eliminated heterogeneity and cre- 
ate racially homogeneous communities. 

Distance Function A: This does not mean that it should eliminate 
heterogeneity and create homogeneous communities racially. 
Distance Function B:This does not mean that should eliminate het- 
erogeneity and create racially homogeneous communities. 
Distance Function CiThis does not mean that should eliminate het- 
erogeneity and create racially homogeneous communities. 

4. Other protest against action affirmative and maintains that a policy 
that about race does not care is sufficient. 

Distance Function A:others protest against an affirmative action and 
maintains that a policy that about race docs care is sufficient. 
Distance Function BiOthers protest against affirmative action and 



46 



maintains that a policy that does not about race care is sufficient. 
Distance Function CrOthers protest against affirmative action and 
maintains that a pohcy that does not about race care is sufficient. 

5. Obviously, minorities have made progressed towards more integration 
and economic success. 

Distance Function AiObviously minorities have made progress to- 
wards more economic success and integration. 

Distance Function B: Obviously, minorities have made progress to- 
wards more integration and economic success. 

Distance Function C:Obviously, minorities have made progress to- 
wards more integration and economic success. 

6. It will not be the case, as is clear the racial history of America. 
Distance Function A:It will not be the case, as is clear in the racial 
history of America. 

Distance Function C:It will not be the case, as is clear in the racial 
history of America. 

Distance Function C:It will not be the case, as is clear in the racial 
history of America. 

As is apparent from the above examples, the Distance Functions B and C 
are almost identical in nature, and the changes performed by these functions 
is basically a superset of the changes performed by the Distance Function A. 
Distance Function A is able to correct substitution based errors, spelling er- 
rors and missing word errors, but may fail to correct reordering based errors. 
In fact, due to the imconstrained nature of the replacement phrases chosen, it 
is likely to introduce errors which were not already present. For example, in 
example 3, it makes changes racially homogeneous communities to homoge- 
neous communities racially. The adjective racially is supposed to qualify the 
word homogeneous, but the output of the algorithm with Distance Function 
A makes a modification to this, rendering the sentence incorrect. Such er- 
rors might creep in due to the lack of constraints on ordering in this function. 
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All three distance functions have the Levenshtein Distance and Synset Dis- 
tance based metrics in common. As a result, they are able to correct the 
remaining categories of errors. For example, 

• Substitution based errors, like in example 4, other protests others 
protest 

• Spelling errors, like in example 2, precsely^ precisely 

• Missing word errors, like in example 6, clear the racial history^ clear 
in the racial history 

The downside of this method is that it is highly dependent on the corpus be- 
ing used, and is an extremely domain- dependent algorithm. The words used 
in the sentences that are being corrected must bear a high degree of corre- 
lation to the words used in the corpus. Otherwise, just in order to increase 
the likelihood of a sentence according to the corpus, some phrases might be 
substituted by completely unrelated phrases. Also, how this technique will 
perform on an extremely free word order language like Sanskrit is unknown, 
largely because of the absence of a suitable corpus in such a language and 
our lack of comprehensibility of the language. Thirdly, the corpus used must 
be large enough for a good enough correction. A small corpus will either 
miss out on possible sources of error introduce new errors just to make the 
sentence more likely according to the corpus. 

This algorithm can find application in a number of areas. For example, 
we can use this for correction of sentences output by machine translation 
systems. If we have a small bitext, but a large enough monolingual corpus in 
the target language, we can use this algorithm with the monolingual corpus 
for postprocessing of the translation outputs. This algorithm can also be 
used for domain specific correction of machine translation outputs. Suppose 
we have a bitext that is generic (not adhering to a particular domain) , but a 
domain specific corpus in the target language. Once the machine translation 
system generates the outputs, we can filter out the domain specific sentences 
and use this algorithm on the monolingual corpus to correct those sentences 
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that belong to the particular domain. This method can also be used for sec- 
ond language learning, by correcting the sentences generated by non-native 
speakers of a particular language. 
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Chapter 5 
Conclusion 



In this work, wc present two algorithms to correct noisy sentences. One per- 
forms reasonably well, whereas the other fails completely. The fixed length 
phrase based algorithm tries to make too many corrections to the original text 
and in the process actually increases the number of errors in the text. This 
problem is rectified by the Dynamic Programming based approach. This ap- 
proach actually formulates the problem as a decoding task and uses a model 
similar to phrase based decoding models in order to make corrections to the 
text. This model, as seen in Chapter 4, is able to correct a number of errors 
and increase the fluency of the sentences according to the language model 
used. We have also suggested a number of applications of this algorithm, like 
correcting machine translation output and domain specific correction of ma- 
chine translation output. And as an example, we have explored the machine 
translation output correction application. 

The primary contribution of the thesis is a method that is language inde- 
pendent and relatively simple. The number of methods explored previously 
by researchers has the disadvantage of being either language dependent or 
highly computation intensive. The Dynamic Programming based method is 
neither very much computation intensive, nor is it at all language depen- 
dent. Besides, it has quite low time and space complexities. The thing that 
distinguishes it from other approaches tried previously is the likeness of the 
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task to decoding. As far as our knowledge goes, nobody has tried to apply 
decoding algorithms to correct errors in sentences previously. Another ma- 
jor contribution of this work is the varying set of distance functions. The 
functions cover almost all types of errors that can possibly occur and can be 
computationally modeled feasibly. 

Much can be done to improve the performance of this algorithm and to 
build up on the work done. Firstly, we have applied this on a domain spe- 
cific corpus. If a corpus can be constructed which incorporates data from 
all domain, the results might be interesting to investigate. Whether this al- 
gorithm can work with data from different domains taken together remains 
to be seen. Another direction that may be followed by future researchers is 
investigating how this algorithm works on different classes of languages and 
whether the type of language has any bearing upon the distance function be- 
ing used. For example, it is likely that for a highly free word order language 
hke Sanskrit, the Distance Function A might be the best choice because it 
does not constrict the word ordering by any means. A third approach may 
be to augment the set of Distance Functions in order to cover more errors. 
How the algorithm performs under these conditions remains to be seen. 
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